* DataSource.ID is now a URI rather than a String.

Basically a good idea, especially in the context of context ;)

* Removed DataSource.getConfiguration, .initConfiguration and
.checkConfiguration for now.

I'm still not sure whether these methods are a good idea as they restrict
configuration data to simple key-value pairs. I'm thinking for example about
giving FileSystemDataSource a *set* of root directories, with for each
directory *separate* *sets* of include and exclude dirs (we've seen multiple
use cases where this would have been very handy). That's 3 things that are
hard to configure in a key-value based API.

Right now I'm going for dedicated configuration methods in each DataSource. We
can always later add such generic configuration methods that internally parse
the configuration data and invoke the specialized methods, although I suspect
this will be without the full configuration possibilities as provided by the
specialized methods. I would be fine with that though, as we always build
dedicated UIs so these additional methods would do us no harm. I'd like to
postpone this decision for now though until we know more about the pros and
cons of this approach.

* Created an ...aperture.model package.

This package holds the DataSource-related interfaces and the DataObject
hierarchy. I've decided to put them apart from the rest, as they are used in
all other parts of the framework except the Extractor part and in my feeling
do not "belong" more to the scope of one of them than to the others.

DataObject has two subtypes: BinaryObject and Folder. The first adds a
getContent method delivering the InputStream, the second has no extra methods.
All other methods have become part of the metadata, i.e. there is no
getContentType, getChildren, etc.

FYI: the MIME type is now also part of the metadata. However, this only holds
a mime type if the data source reported one, i.e. it does NOT return the
result of a MimeTypeIdentifier. IMO this should always be an application
design decision. For example, if you implement a wget-like crawler, you don't
want all this processing stuff to take place and you might even specifically
be interested in what the source returns, rather than what some smart (cl)ass
makes of it ;) It may be so that the mime type created by the
MimeTypeIdentifier is *stored* in the DataObject, but this is up to the
containing system to decide. FYI2: the InfoSource I intend to make will do
this, so people using the entire Aperture framework need not worry about it.

* Naming Changes

e.g. Crawler instead of DataCrawler, CrawlerListener instead of
DataCrawlerListener. This already makes my code easier to read.

Also, some classes may have changed package, but I've lost track of that.

* DataAccessor now gets a Date as parameter rather than an
AccessData/CrawlData.

In all our use cases a Date is sufficient. Furthermore this simplifies a lot
of things:
- DataAccessor implementors need not learn another API (CrawlData)
- Therefore also eases documentation considerably
- CrawlData can be hidden inside the abstract CrawlerBase class, it does not
  even need to live at the Crawler level.
- Only this class handles what's stored in and retrieved from this object. In
  the old setup both the crawler and the accessor read and changed data in it,
  making it possible for one to screw up for the other.

In cases where the DataAccessor needs more information, there is probably
already a strong connection between the crawler and the accessor and you can
specify the additional params in the Map or even combine the Crawler and
DataAccessor implementation in a single class that implements both interfaces.

* Removed HierarchicalAccess

I believe it is redundant. Redundancy is not a problem if it makes life
easier, but I don't even think it does that ;) For example, the root folders
can be get from the DataSource wrapper by the HierarchicalAccess (although it
does not define generic methods for that at the DataSource level). The
DataObjects can be directly retrieved from the DataAccessor, the
HierarchicalAccess would only be delegating calls to it. Finally, information
about super- and subfolders will be part of the metadata of the DataObject,
e.g. a BinaryObject's metadata will have some partOf/containedIn/whatever
property, a Folder's metadata will hold sub and super folders metadata.

* Removed DataFactory.

Its design (one factory for DataSource, DataCrawler, etc.) makes the incorrect
assumption that for a given DataSource implementation the DataCrawler
implementation as well as the implementations of all other interfaces
mentioned here are fixed. In our use cases this is typically not the case.

I propose the following factory approach, with which we have very good
experiences in another OSGi-based system. Warning: long explanation ahead ;)

Each XYZ API interface comes with its own XYZFactory interface whose get()
method embeds the knowledge of how an instance of this type is best
instantiated. For example, it may always return the same statically held
instance, return new instances on each get() call, temporarily cached shared
instances using WeakReferences, etc.

Examples: a PlainTextExtractorFactory always returns the same
PlainTextExtractor instance, as it is stateless. A DataCrawlerFactory will
usually create a new instance, except when the implementation is stateless.
The MagicMimeTypeIdentifierFactory returns a shared instance that is cached
using a WeakReference, as (1) its constructor does some costly initialization,
(2) the instance consumes an significant amount of memory and (3) the identify
method defined in the MimeTypeIdentifier interface does not alter its state.
In other words: you want to keep the instance around as long as it's used but
also get rid of it when you're done.

In some cases the get() method will be called newInstance() when from an
architectural perspective it is vital that a new instance is returned. This is
typically the case for objects that will be configured after being returned by
the factory, e.g. DataSources. For other cases (e.g. DataCrawlers) it will not
matter whether you get a unique instance or not and the decision is then best
left to the XYZFactory implementation. This is expressed by the more neutral
get() method, which makes no assumptions on this matter. If there is ever a
case when it is vital that the instance is shared (haven't encountered one
yet), I would propose a sharedInstance() method.

XYZ implementations are provided as separate OSGi bundles (i.e. separate from
the bundle that provides the XYZ interface itself). An implementation bundle
contains an implementation for both the XYZ and the XYZFactory interface. The
BundleActivator of this bundle should announce that the factory implementation
is an implementation of XYZFactory.

As said above, the XYZ interface is part of a separate bundle that only
provides this API to the system. This bundle has no BundleActivator as it does
not register a service, it only provides a service API.

Besides the XYZ interface itself, this bundle also contains an XYZRegistry.
The job of a registry is to keep track of all the XYZFactory implementations
that are announced to the OSGi platform. Every time the BundleActivator of an
XYZFactory implementation announces the implementation's existance, the
implementation of the XYZRegistry gets notified about this. More specifically,
the BundleActivator of the XYZRegistryImpl makes sure it gets notifications
from the OSGi platform about new XYZFactory implementations and passes this
information to the XYZRegistryImpl, so that the registry itself is still
completely non-OSGi-specific.

When you need an XYZ instance, you approach the XYZRegistry instance (of which
there is only one in the system) with the necessary details (e.g. a MIME type,
a scheme, a DataSource type, etc.) and it will provide you with an appropriate
XYZFactory implementation, if there is any available. This factory will then
provide you with the instance.

The XYZRegistryImpl is part of a separate bundle, it should not be provided
with the bundle containing XYZ and XYZRegistry, as there may be some
application-dependent decisions to make in its implementation. For example, a
DataCrawlerRegistryImpl could take a look at on which OS platform the
application is running and prefer an OS-specific DataCrawler (or actually: its
corresponding DataCrawlerFactory) over an OS-independent implementation,
assuming that OS-specific implementations provide better optimizations. In
different domains there may be different strategies for choosing a factory, so
this should not be part of the bundle that defines the XYZ and XYZRegistry
implementations.

As you can see in the implementations of our factories and registries, there
is actually no OSGi-specific code in them. The code that gets informed about
new factories and passes them on to the registries are part of the OSGi-
specific BundleActivators. This is the only location where use of OSGi is
assumed. It is always possible to directly instantiate a XYZregistryImpl and
pass it a set of XYZFactory implementations, as you can see in the code
examples. However, then you have to make assumptions on which registry and
factory implementations are available. The BundleActivators automate this
process so that you don't have to embed this knowledge in your application.
E.g., add a new Extractor implementation bundle (a jar file!) to your system
and you automatically can handle the mime types it supports. No line of
existing code then needs to be changed.
